Reject QDQ Gemm→QGemm fusion when alpha != 1 with bias by elwhyjay · Pull Request #28131 · microsoft/onnxruntime

elwhyjay · 2026-04-19T07:20:41Z

Description

Add an alpha == 1.0 check to GemmNodeGroupSelector::Check (symmetric to the existing beta == 1.0 check) so the QDQ Gemm → QGemm fusion is skipped when a bias is present and alpha != 1. Extend the QDQ Gemm transformer tests (both the default and fastmath variants) with alpha_not_one cases.

Motivation and Context

Fixes #28130.

When a Gemm has alpha != 1 and a bias, the QDQ fusion produces incorrect results. The root cause is in the QGemm kernel:

acc_int32 = (A - zpA)(B - zpB) + C_int        // bias added to int accumulator
Y_float   = (alpha * sa * sb) * acc_int32      // output scale applied to everything

The output scale (alpha * a_scale * b_scale) is applied to the whole int32 accumulator, which already contains the bias, so alpha ends up multiplying the bias too:

Y = alpha * A_dq * B_dq + alpha * C_float      (buggy)
Y = alpha * A_dq * B_dq + C_float              (expected)

This matches the reporter's observation that the discrepancy vanishes when the bias is zero.

This PR is the minimal correctness fix — it rejects the fusion in the broken case rather than producing wrong output. The follow-up discussed on the issue is to absorb alpha into the int32 bias at graph-transform time (C_int_new = round(C_int / alpha) inside GemmReplaceWithQuant), so alpha != 1 cases can keep the fused path. That is a separate PR because the precision/overflow characteristics want their own review.

Test plan

Extend QDQTransformerGemmTests and its fastmath variant with an alpha_not_one parameter. Set alpha=2.0 on the Gemm node when true, and adjust check_binary_op_graph so fusion is expected to be skipped when bias is present.
On main the new cases fail — both on op count and on output (expected -0.624, got -0.429, diff 0.195, tol 0.016), matching the bug report.
With the selector fix, QDQTransformerTests.Gemm_* (8), QDQTransformerTests.* (146), and GraphTransformationTests.Gemm* (16) all pass locally on macOS arm64.
End-to-end with the reporter's gemm_alpha.py against a locally built wheel (macOS arm64) matches optimized vs unoptimized across all cases:

    alpha=1.0, optimize=False     [[ 39 221 173]]
    alpha=1.0, optimize=True      [[ 39 221 173]]
    alpha=2.0, optimize=False     [[ 47 216 169]]
    alpha=2.0, optimize=True      [[ 47 216 169]]   (was [[0 255 218]] on main)
    alpha=2.0, optimize=False, bias=0   [[144 118 120]]
    alpha=2.0, optimize=True,  bias=0   [[144 118 120]]

elwhyjay · 2026-04-27T09:15:50Z

Friendly ping on this one. @tianleiwu
Most CI is green; the only failure I see is windows_x64_asan, which appears to be an AddressSanitizer OOM in unrelated tests. Happy to rebase/update if preferred.
and would appreciate any guidance on whether this minimal fix is acceptable, or whether maintainers would prefer folding the follow-up bias-rescaling approach into this PR.

Copilot

Pull request overview

This PR fixes an incorrect QDQ Gemm → QGemm fusion case by preventing fusion when a Gemm bias is present and alpha != 1, and extends the existing QDQ Gemm transformer tests to cover alpha != 1 scenarios (including fastmath).

Changes:

Update GemmNodeGroupSelector::Check to reject QDQ Gemm→QGemm fusion for bias-present Gemm when alpha != 1.
Add alpha_not_one coverage to QDQTransformerGemmTests in both the standard and fastmath test suites.
Update test graph expectations so fusion is skipped specifically for the bias + alpha_not_one combination.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc`	Adds an `alpha == 1` requirement (when bias is present) to avoid incorrect bias scaling in QGemm fusion.
`onnxruntime/test/optimizer/qdq_transformer_test.cc`	Extends QDQ Gemm tests with `alpha_not_one` cases and updates fusion expectations accordingly.
`onnxruntime/test/optimizer/qdq_transformer_fastmath_test.cc`	Mirrors the same `alpha_not_one` test coverage for the fastmath variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

elwhyjay · 2026-05-07T00:24:28Z

Hi @tianleiwu, this PR is blocked from auto-merge by a single failing check (windows_x64_asan / build_x64).Could you re-run the ASAN job? Thanks!

tianleiwu · 2026-05-07T05:19:50Z

Hi @tianleiwu, this PR is blocked from auto-merge by a single failing check (windows_x64_asan / build_x64).Could you re-run the ASAN job? Thanks!

Done

elwhyjay · 2026-05-12T13:00:24Z

Gentle ping - @tianleiwu. this PR is still blocked by a same single check. Could yo re-run this job again?

tianleiwu · 2026-05-12T20:06:52Z

Gentle ping - @tianleiwu. this PR is still blocked by a same single check. Could yo re-run this job again?

Job re-tried. Hope that it can pass this time.

elwhyjay · 2026-05-12T21:13:28Z

@tianleiwu Failed again😅. I think there is a st problem with it 🤔 .

tianleiwu · 2026-05-13T01:42:07Z

@tianleiwu Failed again😅. I think there is a st problem with it 🤔 .

Last retry (failed 5 times). If it does not work, please merge/rebase main.

The Gemm→QGemm QDQ fusion selector only validated beta == 1, letting Gemms with alpha != 1 and a bias through. QGemm broadcasts the int32 bias into the accumulator before applying the alpha*sa*sb output scale, so the bias ends up scaled by alpha too — producing incorrect outputs when alpha != 1 (bias == 0 masks the issue). Add an alpha == 1 check alongside the existing beta == 1 check in GemmNodeGroupSelector::Check (only when bias is present — without bias the fused path is still exact). Extend QDQTransformerGemmTests and the fastmath variant with an alpha_not_one parameter so the regression is covered. Follow-up tracked in the issue: absorb alpha into the int32 bias in GemmReplaceWithQuant so alpha != 1 cases can keep the fusion.

elwhyjay · 2026-05-13T02:28:38Z

@tianleiwu Sadly failed at the same point. As you suggested, rebased onto current main. Sory for the trouble

…sure

elwhyjay · 2026-05-19T05:48:29Z

Hi @tianleiwu, sorry to keep bothering you with this. After rebasing to the main, the CI still failed at the same point with the same ASAN OOM, so I suspect the added test coverage might be pushing it over the memory limit. I've reduced the new cases down to the minimal essential ones (keeping the bias + alpha != 1 regression checks). Could you trigger CI once more when you have a moment?

tianleiwu · 2026-05-21T06:05:24Z

@elwhyjay,

Here is AI analysis of the failed CI job:

The failure is most likely caused by TransposeOptimizerTests.TestDequantizeLinearNoAxis, which crashes under ASan with OOM immediately after starting:

Failing test in logs: TransposeOptimizerTests.TestDequantizeLinearNoAxis
Failure: AddressSanitizer: Out of memory. The process has exhausted 8192MB
Failing file: onnxruntime/test/optimizer/transpose_optimizer_test.cc at the test definition for TestDequantizeLinearNoAxis and its helper RunDequantizeLinearTestCase:
- helper: RunDequantizeLinearTestCase at onnxruntime/test/optimizer/transpose_optimizer_test.cc:3669
- failing test: onnxruntime/test/optimizer/transpose_optimizer_test.cc:3760
- source: https://github.com/microsoft/onnxruntime/blob/bf7bc2d094eb19c144cf59168b2f489d45c79f56/onnxruntime/test/optimizer/transpose_optimizer_test.cc
Job definition running ASan build+tests: .github/workflows/windows_build_x64_asan.yml:41
- source: https://github.com/microsoft/onnxruntime/blob/bf7bc2d094eb19c144cf59168b2f489d45c79f56/.github/workflows/windows_build_x64_asan.yml

Root cause

TestDequantizeLinearNoAxis invokes RunDequantizeLinearTestCase for 6 variants:

ONNX domain: uint8_t, uint16_t, int16_t
MS domain: uint8_t, uint16_t, int16_t

In RunDequantizeLinearTestCase, all variants use TransformerTester(...) across multiple opsets:

std::vector<int> opsets = {15, 18, 21};
if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
  if (q_domain == kOnnxDomain) {
    opsets = std::vector<int>{21};
  }
}

So this single test expands into many ASan-heavy session builds. Since the log shows the crash happens exactly when TestDequantizeLinearNoAxis starts, the issue is test memory pressure, not a product correctness failure.

Best fix

Reduce the matrix for this specific no-axis test under ASan, especially for the 16-bit variants and/or MS-domain duplicates, because the behavior being validated is already covered by neighboring tests:

TestDequantizeLinearScalarIgnoreAxis
TestDequantizeLinearVector

A focused change is to keep one representative no-axis case and skip the redundant high-memory variants in ASan builds.

Suggested code change

Update TestDequantizeLinearNoAxis to limit coverage under ASan:

TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
  std::vector<int64_t> zp_value_shape{};
  std::optional<ONNX_NAMESPACE::AttributeProto> no_axis;  // Empty axis value will not be set.

  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);

#if !defined(__SANITIZE_ADDRESS__)
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}

More portable version for MSVC ASan

Because compiler macros differ on Windows, a safer pattern is to add a small local guard:

#if defined(__SANITIZE_ADDRESS__) || defined(ADDRESS_SANITIZER) || defined(_MSC_VER)
#define ORT_ASAN_BUILD 1
#endif

Then:

TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
  std::vector<int64_t> zp_value_shape{};
  std::optional<ONNX_NAMESPACE::AttributeProto> no_axis;

  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);

#if !defined(ORT_ASAN_BUILD)
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}

Preferred cleaner fix

If you want to preserve all coverage outside ASan without sprinkling guards in tests, add an optional opsets override or reduced-coverage mode to the helper and use it only here. Example:

template <typename QuantType>
static void RunDequantizeLinearTestCase(const std::optional<std::vector<int64_t>>& zp_input_shape,
                                        const std::vector<int64_t>& zp_value_shape,
                                        std::optional<ONNX_NAMESPACE::AttributeProto> axis,
                                        const std::string& q_domain = "",
                                        std::vector<int> opsets_override = {}) {
  ...
  std::vector<int> opsets = opsets_override.empty() ? std::vector<int>{15, 18, 21} : std::move(opsets_override);

  if (opsets_override.empty()) {
    if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
      if (q_domain == kOnnxDomain) {
        opsets = std::vector<int>{21};
      }
    }
  }

  TransformerTester(..., opsets);
}

Then:

TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
  ...
#if defined(ORT_ASAN_BUILD)
  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain, {21});
#else
  ...
#endif
}

Why this is the right solution

The workflow is an ASan job (windows_build_x64_asan), and the log shows the process dies from memory exhaustion rather than a semantic assertion failure. The fix should therefore reduce peak memory from this test’s combinatorial coverage, not change runtime code.

…_not_one cases

elwhyjay · 2026-05-21T06:57:51Z

Hi @tianleiwu, thank you for the analysis! I went with option 1 (__SANITIZE_ADDRESS__ guard) because adding _MSC_VER would also disable the variants on non-ASan Windows builds, and MSVC 16.9+ defines __SANITIZE_ADDRESS__ under /fsanitize=address anyway. Also restored the alpha_not_one cases I had trimmed earlier.

Option 3 changes the helper signature and other call sites, so I left it out of this PR - but I can add it here if you'd prefer, or send it as a small follow-up. Thank you again!

elwhyjay · 2026-05-22T05:31:46Z

Hi @tianleiwu, the rerun failed again with the same ASan OOM, but the failing test was TransposeOptimizerTests.TestReduceMaxKeepdimsFalse (line 1265), which runs before TestDequantizeLinearNoAxis (line 3760), so the guard never gets a chance to trigger. Looking at the previous failures too, the OOM consistently happens somewhere in the TransposeOptimizerTests block (lines ~1200–1700).
So it looks like the cumulative memory pressure builds up before reaching the guard. Could you advise on how you'd like to proceed?

tianleiwu · 2026-05-26T16:15:59Z

Hi @tianleiwu, the rerun failed again with the same ASan OOM, but the failing test was TransposeOptimizerTests.TestReduceMaxKeepdimsFalse (line 1265), which runs before TestDequantizeLinearNoAxis (line 3760), so the guard never gets a chance to trigger. Looking at the previous failures too, the OOM consistently happens somewhere in the TransposeOptimizerTests block (lines ~1200–1700). So it looks like the cumulative memory pressure builds up before reaching the guard. Could you advise on how you'd like to proceed?

@elwhyjay, The solution is either reduce memory usage of the test case, or reduce test parallel in ASan to avoid OOM.

Here is a PR for the later: #28675. You can merge it to your branch and test it.

…28130

elwhyjay · 2026-05-27T01:46:46Z

Hi @tianleiwu, thank you for #28675! I've merged the latest main (which now includes your --test_parallel 4 change) and also reverted the ASan guard I had added to transpose_optimizer_test.cc since it's no longer needed. Could you trigger CI when you have a moment?

elwhyjay mentioned this pull request Apr 19, 2026

Optimization breaks QDQ-quantized Gemm graph when alpha != 1 #28130

Open

tianleiwu requested a review from Copilot April 28, 2026 16:42

Copilot started reviewing on behalf of tianleiwu April 28, 2026 16:42 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

tianleiwu previously approved these changes May 1, 2026

View reviewed changes

tianleiwu enabled auto-merge (squash) May 1, 2026 18:25

auto-merge was automatically disabled May 13, 2026 02:26
Head branch was pushed to by a user without write access

elwhyjay force-pushed the fix/qdq-gemm-alpha-28130 branch from 184d71d to e8c90e9 Compare May 13, 2026 02:26

Trim redundant no-bias alpha_not_one cases to reduce ASAN memory pres…

f983383

…sure

elwhyjay dismissed tianleiwu’s stale review via f983383 May 19, 2026 05:48

Merge branch 'microsoft:main' into fix/qdq-gemm-alpha-28130

bf7bc2d

Skip TestDequantizeLinearNoAxis variants under ASan and restore alpha…

f031046

…_not_one cases

tianleiwu enabled auto-merge (squash) May 22, 2026 04:07

tianleiwu previously approved these changes May 22, 2026

View reviewed changes

elwhyjay added 2 commits May 27, 2026 10:45

Revert ASan guard

c6a9c6c

Merge remote-tracking branch 'upstream/main' into fix/qdq-gemm-alpha-…

12b7ec6

…28130

auto-merge was automatically disabled May 27, 2026 01:45
Head branch was pushed to by a user without write access

elwhyjay dismissed tianleiwu’s stale review via 12b7ec6 May 27, 2026 01:45

Conversation

elwhyjay commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Test plan

Uh oh!

elwhyjay commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

elwhyjay commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianleiwu commented May 7, 2026

Uh oh!

elwhyjay commented May 12, 2026

Uh oh!

tianleiwu commented May 12, 2026

Uh oh!

elwhyjay commented May 12, 2026

Uh oh!

tianleiwu commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elwhyjay commented May 13, 2026

Uh oh!

elwhyjay commented May 19, 2026

Uh oh!

tianleiwu commented May 21, 2026

Here is AI analysis of the failed CI job:

Root cause

Best fix

Suggested code change

More portable version for MSVC ASan

Preferred cleaner fix

Why this is the right solution

Uh oh!

elwhyjay commented May 21, 2026

Uh oh!

elwhyjay commented May 22, 2026

Uh oh!

tianleiwu commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elwhyjay commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elwhyjay commented Apr 19, 2026 •

edited

Loading

elwhyjay commented May 7, 2026 •

edited

Loading

tianleiwu commented May 13, 2026 •

edited

Loading

tianleiwu commented May 26, 2026 •

edited

Loading